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Abstract 

Most tasks in natural language processing can 
be cast into question answering (QA) problems 
over language input. We introduce the dynamic 
memory network (DMN), a neural network ar¬ 
chitecture which processes input sequences and 
questions, forms episodic memories, and gener¬ 
ates relevant answers. Questions trigger an itera¬ 
tive attention process which allows the model to 
condition its attention on the inputs and the result 
of previous iterations. These results are then rea¬ 
soned over in a hierarchical recurrent sequence 
model to generate answers. The DMN can be 
trained end-to-end and obtains state-of-the-art 
results on several types of tasks and datasets: 
question answering (Facebook’s bAbI dataset), 
text classification for sentiment analysis (Stan¬ 
ford Sentiment Treebank) and sequence model¬ 
ing for part-of-speech tagging (WSJ-PTB). The 
training for these different tasks relies exclu¬ 
sively on trained word vector representations and 
input-question-answer triplets. 


1. Introduction 


Question answering (QA) is a complex natural language 
processing task which requires an understanding of the 
meaning of a text and the ability to reason over relevant 
facts. Most, if not all, tasks in natural language process¬ 
ing can be cast as a question answering problem: high 
level tasks like machine translation {What is the transla¬ 
tion into French ?)', sequence modeling tasks like named en¬ 
tity recognition (Passos et al. 2014| l (NER) {What are the 
named entity tags in this sentence ?) or part-of-speech tag¬ 
ging (POS) {What are the part-of-speech tags?)-, classifica¬ 
tion problems like sentiment analysis (|Socher et al. 20131 
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I: Jane went to the hallway. 

I: Mary walked to the bathroom. 

I: Sandra went to the garden. 

I: Daniel went back to the garden. 

I: Sandra took the milk there. 

Q: Where is the milk? 

A: garden 

I: It started boring, but then it got interesting. 

Q: What’s the sentiment? 

A: positive 
Q: POS tags? 

A: PRP VBD JJ , CC RB PRP VBD JJ . 


Figure 1. Example inputs and questions, together with answers 
generated by a dynamic memory network trained on the corre¬ 
sponding task. In sequence modeling tasks, an answer mechanism 
is triggered at each input word instead of only at the end. 


{What is the sentiment?)-, even multi-sentence joint clas¬ 
sification problems like coreference resolution {Who does 
"their” refer to?). 

We propose the Dynamic Memory Network (DMN), a neu¬ 
ral network based framework for general question answer¬ 
ing tasks that is trained using raw input-question-answer 
triplets. Generally, it can solve sequence tagging tasks, 
classification problems, sequence-to-sequence tasks and 
question answering tasks that require transitive reasoning. 

The DMN first computes a representation for all inputs and 
the question. The question representation then triggers an 
iterative attention process that searches the inputs and re¬ 
trieves relevant facts. The DMN memory module then rea¬ 
sons over retrieved facts and provides a vector representa¬ 
tion of all relevant information to an answer module which 
generates the answer. 

Fig.[T]provides examples of inputs, questions and answers 
for tasks that are evaluated in this paper and for which a 
DMN achieves a new level of state-of-the-art performance. 
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2. Dynamic Memory Networks 

We now give an overview of the modules that make up the 
DMN. We then examine each module in detail and give 
intuitions about its formulation. A high-level illustration of 
the DMN is shown in Fig. |2.1| 

Input Module: The input module encodes raw text inputs 
from the task into distributed vector representations. In this 
paper, we focus on natural language related problems. In 
these cases, the input may be a sentence, a long story, a 
movie review, a news article, or several Wikipedia articles. 

Question Module: Like the input module, the question 
module encodes the question of the task into a distributed 
vector representation. For example, in the case of question 
answering, the question may be a sentence such as Where 
did the author first fly?. The representation is fed into the 
episodic memory module, and forms the basis, or initial 
state, upon which the episodic memory module iterates. 

Episodic Memory Module: Given a collection of in¬ 
put representations, the episodic memory module chooses 
which parts of the inputs to focus on through the attention 
mechanism. It then produces a ’’memory” vector represen¬ 
tation taking into account the question as well as the pre¬ 
vious memory. Each iteration provides the module with 
newly relevant information about the input. In other words, 
the module has the ability to retrieve new information, in 
the form of input representations, which were thought to 
be irrelevant in previous iterations. 

Answer Module: The answer module generates an answer 
from the hnal memory vector of the memory module. 

A detailed visualization of these modules is shown in Fig|^ 


2.1. Input Module 


In natural language processing problems, the input is a se¬ 
quence of Tj words wi,, wti■ One way to encode the 


input sequence is via a recurrent neural network (Elman 


19911. Word embeddings are given as inputs to the recur¬ 


rent network. At each time step t, the network updates its 
hidden state ht = RNN{L[wt], ht-i), where L is the em¬ 
bedding matrix and wt is the word index of the fth word of 
the input sequence. 


In cases where the input sequence is a single sentence, the 
input module outputs the hidden states of the recurrent net¬ 
work. In cases where the input sequence is a list of sen¬ 
tences, we concatenate the sentences into a long list of word 
tokens, inserting after each sentence an end-of-sentence to¬ 
ken. The hidden states at each of the end-of-sentence to¬ 
kens are then the hnal representations of the input mod¬ 
ule. In subsequent sections, we denote the output of the 
input module as the sequence of Tc fact representations c, 
whereby ct denotes the fth element in the output sequence 



Figure 2. Overview of DMN modules. Communication between 
them is indicated by arrows and uses vector representations. 
Questions trigger gates which allow vectors for certain inputs to 
be given to the episodic memory module. The final state of the 
episodic memory is the input to the answer module. 


of the input module. Note that in the case where the input 
is a single sentence, Tc = Tj. That is, the number of out¬ 
put representations is equal to the number of words in the 
sentence. In the case where the input is a list of sentences, 
Tc is equal the number of sentences. 


Choice of recurrent network: In our experiments, we use 
i gated recurrent network (GRU) ( |Cho et ah] 2014a[ Chung 


et al. 


20141. We also explored the more complex LSTM 
( |Hochreiter & Schmidhuber |1997| l but it performed sim¬ 
ilarly and is more computationally expensive. Both work 
much better than the standard tanh RNN and we postulate 
that the main strength comes from having gates that allow 
the model to suffer less from the vanishing gradient prob¬ 
lem ( |Hochreiter & Schmidhuber] |1997| l. Assume each time 
step t has an input xt and a hidden state hf. The internal 
mechanics of the GRU is dehned as: 


Zt 

= a(w^^^Xt + U^^ht-i 


(1) 

n 

= a(w^^^Xt + U^^'^ht-i 

+ 5^) 

(2) 

ht 

= tanh (Wxt + D o Uht. 


(3) 

ht 

= Zto ht-i + {1- Zt)oh 

h 

(4) 

where o is 

an element-wise product. 


,W G 


and {7(^),C/W,C/ G The dimensions 

n are hyperparameters. We abbreviate the above computa¬ 
tion with ht = GRU{xt, ht-i). 

2.2. Question Module 

Similar to the input sequence, the question is also most 
commonly given as a sequence of words in natural lan¬ 
guage processing problems. As before, we encode the 
question via a recurrent neural network. Given a question 
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Figure 3. Real example of an input list of sentences and the attention gates that are triggered by a specific question from the bAbI tasks 


Weston et al. 


2015a I. Gate values gl are shown above the corresponding vectors. The gates change with each search over inputs. We 


do not draw connections for gates that are close to zero. Note that the second iteration has wrongly placed some weight in sentence 2, 
which makes some intuitive sense, as sentence 2 is another place John had been. 


of Tq words, hidden states for the question encoder at time 
t is given by qt = GRU{L[wf],qt-i), L represents the 
word embedding matrix as in the previous section and wf 
represents the word index of the tth word in the question. 
We share the word embedding matrix across the input mod¬ 
ule and the question module. Unlike the input module, the 
question module produces as output the final hidden state 
of the recurrent network encoder: q = qxQ . 

2.3. Episodic Memory Module 

The episodic memory module iterates over representations 
outputted by the input module, while updating its internal 
episodic memory. In its general form, the episodic memory 
module is comprised of an attention mechanism as well as 
a recurrent network with which it updates its memory. Dur¬ 
ing each iteration, the attention mechanism attends over the 
fact representations c while taking into consideration the 
question representation q and the previous memory 
to produce an episode e*. 

The episode is then used, alongside the previous mem¬ 
ories to update the episodic memory m® = 

GRU (e®, 771®“^). The initial state of this GRU is initialized 
to the question vector itself: mP = q. For some tasks, it 
is beneficial for episodic memory module to take multiple 
passes over the input. After Tm passes, the final memory 
mp'M js given to the answer module. 


Need for Multiple Episodes: The iterative nature of this 
module allows it to attend to different inputs during each 
pass. It also allows for a type of transitive inference, since 
the first pass may uncover the need to retrieve additional 
facts. For instance, in the example in Fig.[^ we are asked 
Where is the football? In the first iteration, the model ought 
attend to sentence 7 {John put down the football.), as the 
question asks about the football. Only once the model sees 
that John is relevant can it reason that the second iteration 
should retrieve where John was. Similarly, a second pass 
may help for sentiment analysis as we show in the experi¬ 
ments section below. 

Attention Mechanism: In our work, we use a gating func¬ 
tion as our attention mechanism. For each pass i, the 
mechanism takes as input a candidate fact ct, a previ¬ 
ous memory ird~^, and the question q to compute a gate: 
gl = G{ct,m''-^,q). 

The scoring function G takes as input the feature set 
z{c,m,q) and produces a scalar score. We first define a 
large feature vector that captures a variety of similarities 
between input, memory and question vectors: z(c, m, q) = 


c,m,q,co q,co m, |c — g|, |c — to|, 

(5)' 

where o is the element-wise product. The function 
G is a simple two-layer feed forward neural network 
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G{c,m,q) = 

a tanh (w^^h{c, m, q) + . (6) 


is the same. This allows for speed-up in implementation 
by computing these gates only once. However, gates for 
subsequent passes will be different, as the episodes are dif¬ 
ferent. 


Some datasets, such as Facebook’s bAbI dataset, spec¬ 
ify which facts are important for a given question. In 
those cases, the attention mechanism of the G function can 
be trained in a supervised fashion with a standard cross¬ 
entropy cost function. 

Memory Update Mechanism: To compute the episode for 
pass i, we employ a modified GRU over the sequence of the 
inputs Cl,..., ctc, weighted by the gates p*. The episode 
vector that is given to the answer module is the final state 
of the GRU. The equation to update the hidden states of the 
GRU at time t and the equation to compute the episode are, 
respectively: 

hi = glGRUicuhU) + {l-gl)hU O) 
e* = (8) 


2.5. Training 

Training is cast as a supervised classification problem to 
minimize cross-entropy error of the answer sequence. For 
datasets with gate supervision, such as bAbI, we add the 
cross-entropy error of the gates into the overall cost. Be¬ 
cause all modules communicate over vector representations 
and various types of differentiable and deep neural net¬ 
works with gates, the entire DMN model can be trained 
via backpropagation and gradient descent. 

3. Related Work 

Given the many shoulders on which this paper is standing 
and the many applications to which our model is applied, it 
is impossible to do related fields justice. 


Criteria for Stopping: The episodic memory module also 
has a signal to stop iterating over inputs. To achieve this, 
we append a special end-of-passes representation to the in¬ 
put, and stop the iterative attention process if this represen¬ 
tation is chosen by the gate function. For datasets without 
explicit supervision, we set a maximum number of itera¬ 
tions. The whole module is end-to-end differentiable. 

2.4. Answer Module 

The answer module generates an answer given a vector. 
Depending on the type of task, the answer module is ei¬ 
ther triggered once at the end of the episodic memory or at 
each time step. 

We employ another GRU whose initial state is initialized to 
the last memory gq = . At each timestep, it takes as 

input the question q, last hidden state at-i, as well as the 
previously predicted output yt-i- 

yt = softmax(lU*'“^ at) (9) 

at = GRU{[yt-uqlat-i), (10) 

where we concatenate the last generated word and the ques¬ 
tion vector as the input at each time step. The output is 
trained with the cross-entropy error classification of the 
correct sequence appended with a special end-of-sequence 
token. 

In the sequence modeling task, we wish to label each word 
in the original sequence. To this end, the DMN is run in 
the same way as above over the input words. For word t, 
we replace Eq. [^with e® = /ij. Note that the gates for the 
■first nass will be the same for each word, as the Question 


Deep Learning: There are several deep learning models 
that have been applied to many different tasks in NLR 
For instance, recursive neural networks have been used for 
parsing (Socher et al. 201 l| l, sentiment analysis (Socher 
et al. 201 3| l, paraphrase detection ([Socher et al.[ 201 1| | and 
quest ion answering (|Iyyer et al. 2014| l and logical infer¬ 
ence ( [Bowman et al.[ 20141, among other tasks. However, 
because they lack the memory and question modules, a sin¬ 
gle model cannot solve as many varied tasks, nor tasks that 
require transitive reasoning over multiple sentences. An¬ 
other commonly used model is the chain-stractured recur¬ 
rent neural network of the kind we employ above. Recur¬ 
rent neural networks have been successfully used in lan¬ 
guage modeling ( jMikolov & Zwei^ |2012| ), speech recog¬ 
nition, and sentence generation from images (Karpathy & 


Fei-Fei 20151. Also relevant is the sequence-to-sequence 
model used for machine translation by Sutskever et al. 


(Sutskever et al. 20141. This model uses two extremely 
large and deep LSTMs to encode a sentence in one lan¬ 
guage and then decode the sentence in another language. 
This sequence-to-sequence model is a special case of the 
DMN without a question and without episodic memory. 
Instead it maps an input sequence directly to an answer se¬ 
quence. 

Attention and Memory: The second line of work that 
is very relevant to DMNs is that of attention and mem¬ 
ory in deep learning. Attention mechanisms are generally 


useful and can improve image classification (Stollenga & 
J. Mascij |2014 |l, automatic image captioning ( jXu et al.j 
2015 |) and machine translation ( jCho et ak] |2014b[ Bah- 
danau et al. 2014| l. Neural Turing machines use memory 
to solve algorithmic problems such as list sorting (Graves 
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et al. 201^. The work of recent months by Weston et 
al. on memory networks ( Weston et al.| 2015bl l focuses 
on adding a memory component for natural language ques¬ 
tion answering. They have an input (I) and response (R) 
component and their generalization (G) and output feature 
map (O) components have some functional overlap with 
our episodic memory. However, the Memory Network can¬ 
not be applied to the same variety of NLP tasks since it 
processes sentences independently and not via a sequence 
model. It requires bag of n-gram vector features as well 
as a separate feature that captures whether a sentence came 
before another one. 

Various other neural memory or attention architectures 
have recently been proposed for algorithmic problems 


Sentiment analysis is a very useful classification task and 


recently the Stanford Sentiment Treebank (Socher et al. 


Poulin & Mikolo^|2015| [Kaiser & Sutskever |2015^, cap 


tion generation for images (Malinowski & Fritz 2014 
Chen & Zitnick 2014|l, visual question answering (Yang 


et al. 


et al. 


2015 I. 


2015 I or other NLP problems and datasets (Hermann 


In contrast, the DMN employs neural sequence models for 
input representation, attention, and response mechanisms, 
thereby naturally capturing position and temporality. As a 
result, the DMN is directly applicable to a broader range 
of applications without feature engineering. We compare 
directly to Memory Networks on the bAbI dataset ( |Weston| 
[eraLl|2015a| i. 

NLP Applications: The DMN is a general model which 
we apply to several NLP problems. We compare to what, 
to the best of our knowledge, is the current state-of-the-art 
method for each task. 

There are many different approaches to question answer¬ 
ing: some build large knowledge bases (KBs) with open in¬ 
formation extraction systems ( [Yates et ak 2007|l, some use 


neural networks, dependency trees and KBs ( Bordes et al. 
|2012| ), others only sentences ( jlyyer et ak] |2()14| l. A lot of 
other approaches exist. When QA systems do not produce 
the right answer, it is often unclear if it is because they 
do not have access to the facts, cannot reason over them 
or have never seen this type of question or phenomenon. 
Most QA dataset only have a few hundred questions and 
answers but require complex reasoning. They can hence 
not be solved by models that have to learn purely from ex¬ 
amples. While synthetic datasets (Weston et al. |2015a l 
have problems and can often be solved easily with manual 
feature engineering, they let us disentangle failure modes 
of models and understand necessary QA capabilities. They 
are useful for analyzing models that attempt to learn every¬ 
thing and do not rely on external features like coreference, 
POS, parsing, logical rules, etc. The DMN is such a model. 
Another related model by Andreas et al. ( 2016| l combines 
neural and logical reasoning for question answering over 
knowledge bases and visual question answering. 


[2013 1 has become a standard benchmark dataset. Kim 
(Kim 20141 reports the previous state-of-the-art result 
based on a convolutional neural network that uses multi¬ 
ple word vector representations. The previous best model 
for part-of-speech tagging on the Wall Street Journal sec¬ 
tion of the Penn Tree Bank ( [Marcus et al.[ [1993[ l was So- 
gaard ( S0gaard[ 201 l[ l who used a semisupervised nearest 
neighbor approach. We also directly compare to paragraph 
vectors by ( [Le & Mikolov.[[2()14[ l. 

Neuroscience: The episodic memory in humans stores 
specific experiences in their spatial and temporal context. 
For instance, it might contain the first memory somebody 
has of flying a hang glider. Eichenbaum and Cohen have ar¬ 
gued that episodic memories represent a form of relation¬ 
ship (i.e., relations between spatial, sensory and temporal 
information) and that the hippocampus is responsible for 
general relational learning ( [Eichenbaum & Cohen 2004) . 
Interestingly, it also appears that the hippocampus is active 
during transitive inference ( [Heckers et al.[ [2004[ ), and dis¬ 
ruption of the hippocampus impairs this ability (Dusek & 
[Eichenbaum[ [ 1 99'7l l. 

The episodic memory module in the DMN is inspired by 
these findings. It retrieves specific temporal states that 
are related to or triggered by a question. Furthermore, 
we found that the GRU in this module was able to do 
some transitive inference over the simple facts in the bAbI 
dataset. This module also has similarities to the Temporal 
Context Model (Howard & Kahana 2002) and its Bayesian 
extensions ( [Socher et al. 2009[ l which were developed to 
analyze human behavior in word recall experiments. 

4. Experiments 

We include experiments on question answering, part-of- 
speech tagging, and sentiment analysis. The model is 
trained independently for each problem, while the archi¬ 
tecture remains the same except for the answer module and 
input fact subsampling (words vs sentences). The answer 


module, as described in Section 2.4 is triggered either once 
at the end or for each token. 

For all datasets we used either the official train, devel¬ 
opment, test splits or if no development set was defined, 
we used 10% of the training set for development. Hyper¬ 
parameter tuning and model selection (with early stopping) 
is done on the development set. The DMN is trained via 


backpropagation and Adam (Kingma & Ba 20141. We 
employ L 2 regularization, and dropout on the word em¬ 


beddings. Word vectors are pre-trained using GloVe (Pen¬ 
nington et al. 20\4). 
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Task 

MemNN 

DMN 

1: Single Supporting Eact 

100 

100 

2: Two Supporting Eacts 

100 

98.2 

3: Three Supporting Eacts 

100 

95.2 

4; Two Argument Relations 

100 

100 

5; Three Argument Relations 

98 

99.3 

6: Yes/No Questions 

100 

100 

7: Counting 

85 

96.9 

8: Lists/Sets 

91 

96.5 

9: Simple Negation 

100 

100 

10: Indefinite Knowledge 

98 

97.5 

11: Basic Coreference 

100 

99.9 

12: Conjunction 

100 

100 

13: Compound Coreference 

100 

99.8 

14; Time Reasoning 

99 

100 

15; Basic Deduction 

100 

100 

16; Basic Induction 

100 

99.4 

17: Positional Reasoning 

65 

59.6 

18: Size Reasoning 

95 

95.3 

19; Path Binding 

36 

34.5 

20: Agent’s Motivations 

100 

100 

Mean Accuracy (%) 

93.3 

93.6 


Table 1. Test accuracies on the bAbI dataset. MemNN numbers 
taken from Weston et al. jWeston et al.[|2015at . The DMN passes 
(accuracy > 95%) 18 tasks, whereas the MemNN passes 16. 


4.1. Question Answering 


The Facebook bAbI dataset is a synthetic dataset for test¬ 
ing a model’s ability to retrieve facts and reason over them. 
Each task tests a different skill that a question answering 
model ought to have, such as coreference resolution, de¬ 
duction, and induction. Showing an ability exists here is 
not sufficient to conclude a model would also exhibit it on 
real world text data. It is, however, a necessary condition. 


Training on the bAbl dataset uses the following objective 
function; J = aEcE{Gates) -f (3EcE{^nswers), where 
Ece is the standard cross-entropy cost and a and /3 are hy¬ 
perparameters. In practice, we begin training with a set to 
1 and P set to 0, and then later switch /3 to 1 while keep¬ 
ing a at 1. As described in Section 2.1 the input module 
outputs fact representations by taking the encoder hidden 
states at time steps corresponding to the end-of-sentence to¬ 
kens. The gate supervision aims to select one sentence per 
pass; thus, we also experimented with modifying Eq. [^to 
a simple softmax instead of a GRU. Here, we compute the 
final episode vector via: e* = softmax(pJ)ct, where 

softmax(p() = i-, . and gl here is the value of 

/ ->j —1 

the gate before the sigmoid. This setting achieves better re¬ 
sults, likely because the softmax encourages sparsity and is 
better suited to picking one sentence at a time. 


Task 

Binary 

Eine-grained 

MV-RNN 

82.9 

44.4 

RNTN 

85.4 

45.7 

DCNN 

86.8 

48.5 

PVec 

87.8 

48.7 

CNN-MC 

88.1 

47.4 

DRNN 

86.6 

49.8 

CT-LSTM 

88.0 

51.0 

DMN 

88.6 

52.1 


Table 2. Test accuracies for sentiment analysis on the Stanford 


Sentiment Treebank. MV-RNN and RNTN:|Socher et al. 

120131 

DCNN:|Ka 

chbrenner et aL|(|2014|i. PVec: |Le & Mikolov. 

120141 

CNN-MC: 

Kim|(|2014J. DRNN: |Irsoy & Cardie (|2015 

, 2014 

CT-LSTM: 

Tai et al I12015J 



We list results in Table [T] The DMN does worse than 
the Memory Network, which we refer to from here on as 
MemNN, on tasks 2 and 3, both tasks with long input se¬ 
quences. We suspect that this is due to the recurrent input 
sequence model having trouble modeling very long inputs. 
The MemNN does not suffer from this problem as it views 
each sentence separately. The power of the episodic mem¬ 
ory module is evident in tasks 7 and 8, where the DMN 
significantly outperforms the MemNN. Both tasks require 
the model to iteratively retrieve facts and store them in a 
representation that slowly incorporates more of the rele¬ 
vant information of the input sequence. Both models do 
poorly on tasks 17 and 19, though the MemNN does better. 
We suspect this is due to the MemNN using n-gram vectors 
and sequence position features. 


4.2. Text Classification: Sentiment Analysis 


The Stanford Sentiment Treebank (SST) ( |S ocher et al.| 
20131 is a popular dataset for sentiment classification. It 


provides phrase-level fine-grained labels, and comes with a 
train/development/test split. We present results on two for¬ 
mats: fine-grained root prediction, where all full sentences 
(root nodes) of the test set are to be classified as either very 
negative, negative, neutral, positive, or very positive, and 
binary root prediction, where all non-neutral full sentences 
of the test set are to be classified as either positive or neg¬ 
ative. To train the model, we use all full sentences as well 
as subsample 50% of phrase-level labels every epoch. Dur¬ 
ing evaluation, the model is only evaluated on the full sen¬ 
tences (root setup). In binary classification, neutral phrases 
are removed from the dataset. The DMN achieves state-of- 
the-art accuracy on the binary classification task, as well as 
on the fine-grained classification task. 


In all experiments, the DMN was trained with GRU se¬ 
quence models. It is easy to replace the GRU sequence 
model with any of the models listed above, as well as in- 
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Model 

Acc (%) 

SVMTool 

97.15 

Sogaard 

97.27 

Suzuki et al. 

97.40 

Spoustova et al. 

97.44 

SCNN 

97.50 

DMN 

97.56 


Table 3. Test accuracies on WSJ-PTB 


corporate tree structure in the retrieval process. 


4.3. Sequence Tagging: Part-of-Speech Tagging 


Part-of-speech tagging is traditionally modeled as a se¬ 
quence tagging problem: every word in a sentence is to 
be classified into its part-of-speech class (see Fig. [H. We 
standard Wall Street Journal dataset ( |Mar-| 
We use the standard splits of sections 
g, 19-21 for development and 22-24 for test 
201 l| l. Since this is a word level tagging 
task, DMN memories are classified at each time step corre¬ 
sponding to each word. This is described in detail in Sec¬ 
tion [Z^s discussion of sequence modeling. 


evaluate on th 


cus et al. 


0-18 for traini 
sets (Spgaard 


We compare the DMN with the results in ( |S0gaard 201 l| l. 
The DMN achieves state-of-the-art accuracy with a single 
model, reaching a development set accuracy of 97.5. En- 
sembling the top 4 development models, the DMN gets to 
97.58 dev and 97.56 test accuracies, achieving a slightly 
higher new state-of-the-art (Table [^. 


4.4. Quantitative Analysis of Episodic Memory Module 

The main novelty of the DMN architecture is in its episodic 
memory module. Hence, we analyze how important the 
episodic memory module is for NLP tasks and in particular 
how the number of passes over the input affect accuracy. 

Table 0] shows the accuracies on a subset of bAbI tasks as 
well as on the Stanford Sentiment Treebank. We note that 
for several of the hard reasoning tasks, multiple passes over 
the inputs are crucial to achieving high performance. For 
sentiment the differences are smaller. However, two passes 
outperform a single pass or zero passes. In the latter case, 
there is no episodic memory at all and outputs are passed 
directly from the input module to the answer module. We 
note that, especially complicated examples are more of¬ 
ten correctly classified with 2 passes but many examples 
in sentiment contain only simple sentiment words and no 
negation or misleading expressions. Hence the need to have 
a complicated architecture for them is small. The same is 
true for POS tagging. Here, differences in accuracy are less 
than 0.1 between different numbers of passes. 


Max 

passes 

task 3 
three-facts 

task 7 

count 

task 8 
lists/sets 

sentiment 
(fine grain) 

0 pass 

0 

48.8 

33.6 

50.0 

1 pass 

0 

48.8 

54.0 

51.5 

2 pass 

16.7 

49.1 

55.6 

52.1 

3 pass 

64.7 

83.4 

83.4 

50.1 

5 pass 

95.2 

96.9 

96.5 

N/A 


Table 4. Effectiveness of episodic memory module across tasks. 
Each row shows the final accuracy in term of percentages with 
a different maximum limit for the number of passes the episodic 
memory module can take. Note that for the 0-pass DMN, the 
network essential reduces to the output of the attention module. 

hard examples with mixed positive/negative vocabulary. 

4.5. Qualitative Analysis of Episodic Memory Module 

Apart from a quantitative analysis, we also show qualita¬ 
tively what happens to the attention during multiple passes. 
We present specific examples from the experiments to illus¬ 
trate that the iterative nature of the episodic memory mod¬ 
ule enables the model to focus on relevant parts of the input. 
For instance. Table shows an example of what the DMN 
focuses on during each pass of a three-iteration scan on a 
question from the bAbI dataset. 

We also evaluate the episodic memory module for senti¬ 
ment analysis. Given that the DMN performs well with 
both one iteration and two iterations, we study test exam¬ 
ples where the one-iteration DMN is incorrect and the two- 
episode DMN is correct. Looking at the sentences in Lig.|^ 
and|^ we make the following observations: 

1. The attention of the two-iteration DMN is generally 
much more focused compared to that of the one- 
iteration DMN. We believe this is due to the fact that 
with fewer iterations over the input, the hidden states 
of the input module encoder have to capture more of 
the content of adjacent time steps. Hence, the atten¬ 
tion mechanism cannot only focus on a few key time 
steps. Instead, it needs to pass all necessary informa¬ 
tion to the answer module from a single pass. 

2. During the second iteration of the two-iteration DMN, 
the attention becomes significantly more focused on 
relevant key words and less attention is paid to strong 
sentiment words that lose their sentiment in context. 
This is exemplified by the sentence in Lig. [^that in¬ 
cludes the very positive word ’’best.” In the first iter¬ 
ation, the word ’’best” dominates the attention scores 
(darker color means larger score). However, once its 
context, ”is best described”, is clear, its relevance is 
diminished and ’’lukewarm” becomes more important. 


Next, we show that the additional correct classifications are 


We conclude that the ability of the episodic memory mod- 
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Question: Where was Mary before the Bedroom? 
Answer: Cinema. 


Facts 


Episode 1 Episode 2 Episode 3 


Yesterday Julie traveled to the school. 
Yesterday Marie went to the cinema. 

This morning Julie traveled to the kitchen. 

Bill went back to the cinema yesterday. 

Mary went to the bedroom this morning. 

Julie went back to the bedroom this afternoon, 
[done reading] 


Table 5. An example of what the DMN focuses on during each episode on a real query in the bAbI task. Darker colors mean that the 
attention weight is higher. 



2-iter DMN (pred: positive, ans: positive) 




2-iter DMN (pred: negative, ans: negative) 








<0 




2-iter DMN (pred: negative, ans: negative) 




Figure 4. Attention weights for sentiment examples that were 
only labeled correctly by a DMN with two episodes. The y-axis 
shows the episode number. This sentence demonstrates a case 
where the ability to iterate allows the DMN to sharply focus on 
relevant words. 


Figure 5. These sentence demonstrate cases where initially posi¬ 
tive words lost their importance after the entire sentence context 
became clear either through a contrastive conjunction (’’but”) or a 
modified action ’’best described.” 


ule to perform multiple passes over the data is beneficial. It 
provides significant benefits on harder bAbI tasks, which 
require reasoning over several pieces of information or 
transitive reasoning. Increasing the number of passes also 
slightly improves the performance on sentiment analysis, 
though the difference is not as significant. We did not at¬ 
tempt more iterations for sentiment analysis as the model 
struggles with overfitting with three passes. 


5. Conclusion 

The DMN model is a potentially general architecture for a 
variety of NLP applications, including classification, ques¬ 
tion answering and sequence modeling. A single architec¬ 
ture is a first step towards a single joint model for multi¬ 
ple NLP problems. The DMN is trained end-to-end with 
one, albeit complex, objective function. Future work will 
explore additional tasks, larger multi-task models and mul¬ 
timodal inputs and questions. 
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